This project is for the course Introduction to Open Data Science in University of Helsinki, fall 2018. The learning objectives of this course are to understand possibilities of open science, open data, and open research tools, visualize and analyze data sets with some fairly advanced statistical methods, master a few state-of-the-art, freely available, open software tools, apply the principle of reproducible research and see its practical advantages, and learn more of all this in the near and distant future (“life-long learning”). My github repository can be found at: https://github.com/ellitaimela/IODS-project
The data analysis exercise for the second course week consisted of analysing students’ learning data from the course Introduction to Social Statistics, collected during December 2014 and January 2015. The exercise consisted of reading and exploring the structure of a dataset provided, showing a graphical overview and summaries of the data, and finally creating and analysing a regression model from the data.
I retrieved the data provided for this exercise from AWS. The data, formatted in a .txt file, contained data of students’ learning outcomes in the course Introduction to Social Statistics held in fall 2014. The data included 166 observations of 7 variables. The variables represented the students’ gender, age, and attitude towards the course, and the students’ success in exam/exercise questions related to deep learning, strategic learning, and surface learning, and the granted for the students.
# Reading the data
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",
sep=",", header=TRUE)
# The data includes students' learning data from the course Introduction to Social Statistics,
# collected during 3.12.2014 - 10.1.2015
# Exploring the structure and dimensions of the data; 166 observations, of 7 variables
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(lrn14)
## [1] 166 7
# Summaries of the variables
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
I created a graphical overview of the distribution of the variables and their relationships. First, I started with boxplotting the whole dataset to see and compare the distributions of the variables. Second, I created a relationship plot matrix with ggpairs to see the overall picture of the relationships between variables. After this, I examined the distributions of variables in more detail. I showed the distribution of students by gender with a barplot, and for the rest of the variables, I examined the distributions with histograms. I also analyzed the relationships of the variables with qplots and plots.
library(ggplot2)
library(GGally)
# Plotting the distributions of the variables
boxplot(lrn14)
# Creating a plot matrix with ggpairs
ggpairs(lrn14, lower = list(combo = wrap("facethist", bins = 20)))
# 110 females, 56 males
barplot(table(lrn14$gender))
# Age distribution from 17 to 55, median 22, mean 25.51, standard deviation 7.766078
hist(lrn14$age)
# Attitude from 1 to 5, median 3.2, mean 3.14, standard deviation 0.5327656
hist(lrn14$attitude)
# Points of deep learning questions from 1 to 5, median 3.667, mean 3.680, sd 0.5541369
hist(lrn14$deep)
# Points of strategic learning questions from 1 to 5, median 3.188, mean 3.121, sd 0.7718318
hist(lrn14$stra)
# Points of surface learning questions from 1 to 5, median 2.833, mean 2.787, sd 0.5288405
hist(lrn14$surf)
# Points granted altogether from 7 to 33, median 23.00, mean 22.72, sd 5.894884
hist(lrn14$points)
# A positive colleration between attitude and points can be found
qplot(attitude, points, data = lrn14) + geom_smooth(method = "lm")
qplot(age, points, data = lrn14) + geom_smooth(method = "lm")
qplot(surf, points, data = lrn14) + geom_smooth(method = "lm")
# There is no correlation between age and attitude
qplot(age, attitude, data = lrn14) + geom_smooth(method = "lm")
# The correlation between age to points achieved in strategic and deep questions looks minimal
qplot(age, stra, data = lrn14) + geom_smooth(method = "lm")
qplot(age, deep, data = lrn14) + geom_smooth(method = "lm")
# There can be seen some negative correlation between age and points achieved in surface questions - statistical significance not analyzed
qplot(age, surf, data = lrn14) + geom_smooth(method = "lm")
# On average, men achieved marginally higher points compared to women - statistical significance not analyzed
plot(lrn14$points ~ lrn14$gender)
# On average, women achieved higher points in strategic and surface learning questions - statistical significance not analyzed
plot(lrn14$stra ~ lrn14$gender)
plot(lrn14$deep ~ lrn14$gender)
plot(lrn14$surf ~ lrn14$gender)
I started building the regression model by choosing three variables - attitude, stra, and surf - as explanatory variables and fitting a regression model where exam points was the dependent variable. I chose the variables because they had the hichest correlation with points, as could be seen in the ggpairs matrix.
# Fitting a linear model
model1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model1)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Of the explanatory variables, only attitude turned out statistically significant, its p-value being 1.93e-08. The p-values of stra and surf were not statistically significant, so I excluded them from the further analysis. I eventually tested combinations of all variables in the dataset, but I found no other statistically significant explanatory variable than attitude. Therefore, I created a new regression model with attitude as the only explanatory variable. Statistically significant (p<0.0001) estimates for both the intercept and attitude were found.
# Fitting a new model with attitude as the only explanatory variable
model2 <- lm(points ~ attitude, data = lrn14)
summary(model2)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
As can be seen in the summary of the regression model above (model2), exam points are positively correlated with the one’s attitude. This is intuitional, because a high (positive) attitude towards a specific topic typically equals a higher interest and motivation to learn, which typically leads one to investing more time and effort to the topic. OF the summary, we can see that one increase in attitude (scale from 1 to 5) increases points by 3.5255 on average.
The residual sum of squares (multiple R-squred) represents how close the data is to the fitted regression line of the model. R-squared achieves values from 0 to 100 % (or 1), which describes the percentage of the target variable variation that can be explained by the model. Multiple R-squared of model2 is 0.1906, which means that a minor part of the data lies close to the regression line.
One natural assumption of the linear regression model created above is linearity. Another assumption related to linear regression models is that errors are normally distributed, they are not correlated, and have a constant variance. To analyze whether these assumptions are actually valid, we can take a closer look at the residuals of the model.
The normality of the errors can be analyzed with the Q-Q plot. As can be seen, a clear majority of the residuals follow the line. Only the residuals at very low and very high values deviate from the line. Therefore we can assume the error terms mostly to be normally distributed.
By plotting the residuals towards the fitted values of the model, we can see if there is any pattern that tells how the residuals change and deduce whether the variance of the residuals is constant. As we look at the Residuals vs. Fitted plot below, there does not seem to be a clear pattern. Therefore we can consider the assumption of a constant variance somewhat valid.
By examining the Residuals vs. Leverage plot below, we can also analyze the impact of single observations on the model. As we can see, no single observation stands out clearly, and the value of the leverages of all observations are low. Therefore we can reason that the impact of single observations on the model is not too high.
par(mfrow = c(2,2))
# Visualizing Residuals vs. Fitted values
plot(model2, which = 1)
# Visualizing a normal QQ-plot
plot(model2, which = 2)
# Visualizing Residuals vs. Leverage
plot(model2, which = 5)
The data analysis exercise for the third course week consisted of analysing the relationship between students’ alcohol consumption and learning outcomes.
A new markdown file ‘chapter3.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.
I read the data formulated in the Data Wrangling part of this week’s exercise, and checked the structure, dimensions and summary of the dataset. The data describes student achievement in secondary education at two Portuguese schools. The data attributes included student grades, demographic, social and school related features, and it was collected by using school reports and questionnaires. Two datasets were provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
library(tidyr)
# Setting the working directory
setwd("/Users/ellitaimela/Documents/Github/IODS-project/Data")
# Reading the data
alc <- read.csv("alc.csv", sep=",")
# This data approach student achievement in secondary education of two Portuguese schools.
# The data attributes include student grades, demographic, social and school related features)
# and it was collected by using school reports and questionnaires. Two datasets are provided
# regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).
# Checking the data, and the structure, dimensions and summary of the dataset
# alc
str(alc)
## 'data.frame': 382 obs. of 36 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 2 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ G1 : int 2 7 10 14 8 14 12 8 16 13 ...
## $ G2 : int 8 8 10 14 12 14 12 9 17 14 ...
## $ G3 : int 8 8 11 14 12 14 12 10 18 14 ...
## $ alc_use : num 1 1 2.5 1 1.5 1.5 1 1 1 1 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382 36
summary(alc)
## X school sex age address famsize
## Min. : 1.00 GP:342 F:198 Min. :15.00 R: 81 GT3:278
## 1st Qu.: 96.25 MS: 40 M:184 1st Qu.:16.00 U:301 LE3:104
## Median :191.50 Median :17.00
## Mean :191.50 Mean :16.59
## 3rd Qu.:286.75 3rd Qu.:17.00
## Max. :382.00 Max. :22.00
## Pstatus Medu Fedu Mjob Fjob
## A: 38 Min. :0.000 Min. :0.000 at_home : 53 at_home : 16
## T:344 1st Qu.:2.000 1st Qu.:2.000 health : 33 health : 17
## Median :3.000 Median :3.000 other :138 other :211
## Mean :2.806 Mean :2.565 services: 96 services:107
## 3rd Qu.:4.000 3rd Qu.:4.000 teacher : 62 teacher : 31
## Max. :4.000 Max. :4.000
## reason nursery internet guardian traveltime
## course :140 no : 72 no : 58 father: 91 Min. :1.000
## home :110 yes:310 yes:324 mother:275 1st Qu.:1.000
## other : 34 other : 16 Median :1.000
## reputation: 98 Mean :1.448
## 3rd Qu.:2.000
## Max. :4.000
## studytime failures schoolsup famsup paid activities
## Min. :1.000 Min. :0.0000 no :331 no :144 no :205 no :181
## 1st Qu.:1.000 1st Qu.:0.0000 yes: 51 yes:238 yes:177 yes:201
## Median :2.000 Median :0.0000
## Mean :2.037 Mean :0.2016
## 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :3.0000
## higher romantic famrel freetime goout
## no : 18 no :261 Min. :1.000 Min. :1.00 Min. :1.000
## yes:364 yes:121 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :4.000 Median :3.00 Median :3.000
## Mean :3.937 Mean :3.22 Mean :3.113
## 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health absences
## Min. :1.000 Min. :1.000 Min. :1.000 Min. : 0.0
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 1.0
## Median :1.000 Median :2.000 Median :4.000 Median : 3.0
## Mean :1.482 Mean :2.296 Mean :3.573 Mean : 4.5
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 6.0
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :45.0
## G1 G2 G3 alc_use
## Min. : 2.00 Min. : 4.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:1.000
## Median :12.00 Median :12.00 Median :12.00 Median :1.500
## Mean :11.49 Mean :11.47 Mean :11.46 Mean :1.889
## 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :18.00 Max. :18.00 Max. :18.00 Max. :5.000
## high_use
## Mode :logical
## FALSE:268
## TRUE :114
##
##
##
I chose 4 variables to study their relationships with the variables and high/low alcohol consumption. The variables I chose were famrel (quality of family relationships), goout (going out with friends), absences (number of school absences), and G3 (final grade). According to my initial hypothesis, students who have worse relationships with their family tend to feel worse and consume more alcohol. I also hypothesized that going out with friends and the number of school absences positively correlate with alcohol consumption, and the final grade negatively correlates with alcohol consumption.
I draw boxplots of the chosen variables distributions with division to high and low alcohol consumption. As can be seen, the results are in line with my initial hypotheses. The average grades are higher with students that consume less alcohol, and students who have better relationships with their family members consume less alcohol. Students who go out with friends more or have more absences also consume more alcohol. The numerical differences of the means of the variables can also be seen below. However, the statistical signifigance cannot be deduced from these results.
# initialize a plot of high_use and family relationships
g1 <- ggplot(alc, aes(x = high_use, y = famrel))
# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("sex")
# initialize a plot of high_use and going out
g2 <- ggplot(alc, aes(x = high_use, y = goout))
# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("going out")
# initialize a plot of high_use and absences
g3 <- ggplot(alc, aes(x = high_use, y = absences))
# define the plot as a boxplot and draw it
g3 + geom_boxplot() + ylab("absences")
# initialize a plot of high_use and G3
g4 <- ggplot(alc, aes(x = high_use, y = G3))
# define the plot as a boxplot and draw it
g4 + geom_boxplot() + ylab("grade")
# produce summary statistics by group
alc %>% group_by(high_use) %>% summarise(count = n(), famrel = mean(famrel), goout = mean(goout), absences = mean(absences), mean_grade = mean(G3))
## # A tibble: 2 x 6
## high_use count famrel goout absences mean_grade
## <lgl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 FALSE 268 4.00 2.85 3.71 11.7
## 2 TRUE 114 3.78 3.72 6.37 10.8
I created a logistic regression model m with the boolean variable high_use as the target variable. High_use returns true if a student consumes alcohol highly. The explaining variables are the previously chosen variables: famrel, goout, absences, and G3. According to the results (see the output of summary(m)), going out and the number of absences positively correlate with high alcohol consumption very significantly (p<0.001). Family relationships (1 = bad, 5 = good) are negatively correlated with alcohol consumption (p<0.05). The relationship between alcohol consumption and the final grade G3 was not statistically significant (p = 0.24).
I combined and printed out the odds ratios and their confidence intervals. As can be seen from the print below, the predictor goout has the widest confidence interval. The interval of G3 is contains the value 1, with an odd ratio of 0.95 - therefore it is hard to determine if G3 is positively or negatively associated with high_use. The odd ratios and confidence intervals tell that goout is (positively) associated with high_use the most. Famrel is negatively associated with high_use (the whole confidence interval <1). Absences are positively associated with high_use, but only a littse, since the confidence interval stands between 1.032 and 1.125.
# find the model with glm()
m <- glm(high_use ~ famrel + goout + absences + G3, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9389 -0.7743 -0.5392 0.9022 2.3966
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.83044 0.78397 -2.335 0.019552 *
## famrel -0.34483 0.13563 -2.542 0.011010 *
## goout 0.75149 0.12033 6.245 4.23e-10 ***
## absences 0.07261 0.02174 3.340 0.000839 ***
## G3 -0.04482 0.03843 -1.166 0.243529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 394.43 on 377 degrees of freedom
## AIC: 404.43
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) famrel goout absences G3
## -1.83043900 -0.34482682 0.75148543 0.07261377 -0.04481705
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- exp(confint(m))
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1603432 0.0331851 0.724006
## famrel 0.7083430 0.5415610 0.923423
## goout 2.1201470 1.6848948 2.703401
## absences 1.0753151 1.0316028 1.124827
## G3 0.9561724 0.8866759 1.031252
I refined my model and excluded the final grade G3 from it because the relationship with alcohol consumption was not statistically significant.
As can be see below from the 2x2 cross tabulation, 69 of (244 + 69 =) 313 students who were predicted as not low users were actually high users of alcohol. 24 of (24 + 45 =) 69 students who were predicted as high users were actually low users. Therefore (69+24)/(313+69) = 0.243455 = 24.3 percent of the individuals were classified inaccurately. The same results can be achieved by building a similar loss function as in the DataCamp exercise. As can be seen, the prediction power is not perfect but it is higher compared to a simple, totally random guessing strategy where the probability of guessing righ from two options would be only 50 percent.
# find the model with glm()
m2 <- glm(high_use ~ famrel + goout + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m2)
##
## Call:
## glm(formula = high_use ~ famrel + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9018 -0.7731 -0.5466 0.9002 2.4180
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.38825 0.63161 -3.781 0.000156 ***
## famrel -0.34986 0.13534 -2.585 0.009735 **
## goout 0.77071 0.11981 6.433 1.25e-10 ***
## absences 0.07446 0.02174 3.425 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 395.79 on 378 degrees of freedom
## AIC: 403.79
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m2)
## (Intercept) famrel goout absences
## -2.38824753 -0.34985642 0.77071403 0.07445633
# predict() the probability of high_use
probabilities <- predict(m2, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 244 24
## TRUE 69 45
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63874346 0.06282723 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
According to a 10-fold cross-validation of the model m2, the prediction error with the test set gets values of around 0.23 to 0.25, which are smaller than the error of the model introduced in DataCamp (circa 0.26 error). The code and results can be seen below.
library(boot)
# K-fold cross-validation
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486911
The data analysis exercise for the third course week consisted of analysing the relationship between xxxx.
A new markdown file ‘chapter4.Rmd’ was created and included as a child file in the ‘index.Rmd’ file.
I read the Boston data that includes housing values in the suburbs of Boston. Columns include data of for example per capita crime rate, proportion of non-retail business acres, and weighted mean of distances to five Boston employment centres by town. The dimensions of the data are 506 x 14.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(GGally)
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.4.2 v purrr 0.2.5
## v readr 1.1.1 v stringr 1.3.1
## v tibble 1.4.2 v forcats 0.3.0
## -- Conflicts ------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
# Loading the data
data(Boston)
# Checking the data, and the structure, dimensions and summary of the dataset
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
I the graphical overview of the data. As can be seen from the ggpairs matrix and the summary statistics, the distributions vary between the variables. Crim, for example, can be considered a Poisson distribution, and rm a normal distribution. Many of the distributions are left- or right-taled.
The correlations between variables seem logical. For example crime rates, industrial areas closeby, nitrogen oxides concentration, the high age of the buildings, the accessibility to radial highways, the property tax rate, the pupil-teacher ratio, and the lower status of the population negatively correlate with the median value of occupied homes within a town. The Charles River dummy variable has the smallest absolute correlation among other variables.
# Graphical overview of the data
#pairs(Boston)
#ggpairs(Boston, lower = list(combo = wrap("facethist", bins = 20)))
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
# Summaries of the variables
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
I standardized the dataset with the scale-function. The data changed such that the median value of each variable returns 0, and the scaled values have the same variance and form of distribution than the original dataset. I also created a categorical variable of the crime rate in the Boston dataset from the scaled crime rate. I used the quantiles as the break points in the catgorical variable. The old crime rate variable was then dropped from the dataset. Then I divided the dataset to train and test sets so that 80 percent of the data belonged to the train set and 20 percent to the test set.
boston_scaled <- scale(Boston)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
# summary of the scaled dataset
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
label_vector <- c("low", "med_low", "med_high", "high")
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = label_vector)
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
#remove the origical crime rate crim and add crime
boston_scaled <- subset(boston_scaled, select = -c(crim))
boston_scaled$crime <- crime
str(boston_scaled)
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
I fit the Linear Discriminant Analysis on the train set. I used the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. After this, I drew the LDA plot.
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2524752 0.2524752 0.2475248 0.2475248
##
## Group means:
## zn indus chas nox rm
## low 1.01430032 -0.8595016 -0.07933396 -0.8855684 0.43116518
## med_low -0.06498447 -0.3190847 -0.07933396 -0.5715963 -0.12437682
## med_high -0.37404455 0.2102693 0.23949396 0.4452535 0.03974513
## high -0.48724019 1.0149946 -0.11484506 1.0174086 -0.42214226
## age dis rad tax ptratio
## low -0.9037874 0.9071554 -0.6756133 -0.7274356 -0.4295921
## med_low -0.3463170 0.3937214 -0.5506331 -0.5067360 -0.1026355
## med_high 0.4160772 -0.3945964 -0.3973015 -0.2832442 -0.3448279
## high 0.7909058 -0.8476113 1.6596029 1.5294129 0.8057784
## black lstat medv
## low 0.37442550 -0.76562524 0.50309288
## med_low 0.32920358 -0.15117274 0.00474761
## med_high 0.05611672 0.06031831 0.13365008
## high -0.77339653 0.88710015 -0.71096361
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08343713 0.71736209 -0.982867727
## indus -0.02722484 -0.03443326 0.168804810
## chas -0.06357856 -0.06780957 -0.008971079
## nox 0.24242165 -0.79918647 -1.311766111
## rm -0.08510252 -0.04515487 -0.157472512
## age 0.32301982 -0.29943434 -0.024226948
## dis -0.14646102 -0.22037360 0.174348587
## rad 3.15798912 1.10614878 0.036631226
## tax 0.07736913 -0.27389242 0.429227307
## ptratio 0.13038523 0.05437343 -0.297659718
## black -0.16656743 0.02701712 0.142104435
## lstat 0.18752384 -0.30657671 0.332505471
## medv 0.17172326 -0.46329524 -0.174932456
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9473 0.0412 0.0115
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen=2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 2)
I saved the crime categories from the test set and removed the categorical crime variable from the test dataset. After this, I predicted the classes with the LDA model on the test data. When cross-tabulating the results with the crime categories from the test set, you can see that the model is more capable in predicting high crime rates compared to low crime rates.
# save the correct classes from test data
correct_classes <- test$crime
test <- subset(test, select = -c(crime))
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
str(lda.pred)
## List of 3
## $ class : Factor w/ 4 levels "low","med_low",..: 1 2 2 3 2 2 2 2 2 2 ...
## $ posterior: num [1:102, 1:4] 0.6474 0.3931 0.0625 0.0797 0.0859 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "1" "17" "25" "29" ...
## .. ..$ : chr [1:4] "low" "med_low" "med_high" "high"
## $ x : num [1:102, 1:3] -3.64 -2.61 -1.76 -1.86 -1.77 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:102] "1" "17" "25" "29" ...
## .. ..$ : chr [1:3] "LD1" "LD2" "LD3"
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 12 1 0
## med_low 4 13 7 0
## med_high 1 7 18 0
## high 0 0 1 26
I reloaded the Boston dataset, standardized the dataset, calculated the distances between the observations, and ran the k-means algorithm on the dataset. When visualizing the clusters with pairs, you can see that the optimal number of clusters turns out to be 3.
# load MASS and Boston
library(MASS)
data('Boston')
boston_scaled.new <- scale(Boston)
# euclidean distance matrix
dist_eu <- dist(boston_scaled.new)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled.new, method="manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <-kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
pairs(Boston[6:10], col = km$cluster)